Parametric annealing: a stochastic search method for human pose tracking 
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Abstract 

Model based methods to marker-free motion capture 
have a very high computational overhead that make them 
unattractive. In this paper we describe a method that im- 
proves on existing global optimization techniques to track- 
ing articulated objects. Our method improves on the state- 
of-the-art Annealed Particle Filter (APF) by reusing sam- 
ples across annealing layers and by using an adaptive para- 
metric density for diffusion. We compare the proposed 
method with APF on a scalable problem and study how the 
two methods scale with the dimensionality, multi-modality 
and the range of search. Then we perform sensitivity analy- 
sis on the parameters of our algorithm and show that it tol- 
erates a wide range of parameter settings. We also show re- 
sults on tracking human pose from the widely-used Human 
Eva I dataset. Our results show that the proposed method 
reduces the tracking error despite using less than 50% of the 
computational resources as APF. The tracked output also 
shows a significant qualitative improvement over APF as 
demonstrated through image and video results. 

1. Introduction 

Tracking an articulated object like a human body with 
many degrees of freedom is an active research area in the 
vision community. A large body of research on human pose 
tracking and estimation suitable to different application ar- 
eas is found in the literature [12]. The model-based genera- 
tive method to tracking humans [14, 5, 6, 16, 3, 13, 17, 8] is 
a critical method with applications in animation, sports and 
medical motion analysis. Relying on a kinematic model to 
support the tracked hypothesis it remains one of the most 
accurate ways to track a human. However the accuracy 
comes with a significant computational overhead. 

Model based generative methods model a human by a 
kinematic tree controlled by a fixed set of parameters. The 
imaging process is modeled by a projection operation and 
pose estimation is formulated as a state estimation problem 
that aims to minimize the disparity between the actual ob- 
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servation and the generated image as a function of the pa- 
rameters of the kinematic model. Every evaluation of the 
disparity function incurs a significant computational over- 
head which can be considered to be of the order of the res- 
olution and the number of cameras used for tracking. 

Existing literature on the method either approach the 
problem as that of a Bayesian filtering problem [ ] or as 
an optimization problem. Since the observation is not a 
random vector correlated with the hidden state, as is the 
case in filtering problems, Bayesian methods either resort 
to local optimization [3, 16] to recover and track the modes 
or choose a nonparametric method like particle filter [14]. 
In practice, however, both these methods need a significant 
number of evaluations of the disparity function, resulting in 
a high computational overhead to track. 

Alternate approaches [8, 13, 7] formulate the problem as 
a search for the global mode of the likelihood assuming the 
disparity to be the negative log likelihood. This line of tech- 
niques originate from the Annealed Particle Filter (APF) 
[7]. APF is a generic procedure applicable to several types 
of input data including silhouettes from video streams and 
3D reconstructions of human. Moreover it does not rely on 
learned prior dynamics [15], hence it remains an attractive 
algorithm and is largely considered to be the state-of-the-art 
in model-based human pose tracking. 

APF reveals that annealing can be a powerful tool when 
applied to high dimensional multi-modal problems. How- 
ever, we observe that the full potential of annealing is not 
realized by the APF. Since the cost of each likelihood eval- 
uation is very high, there is a need to extract as much infor- 
mation as possible from each sample. Our procedure, which 
we refer to as parametric annealing (PA), is aimed at reduc- 
ing the total number of samples by reusing samples across 
annealing layers and by using an adaptive parametric den- 
sity to generate new samples. Our experiments show that it 
is capable of tracking more accurately with less than 50% 
of the number of samples required by APF. In this paper, 
we study various properties of our method including how it 
scales with the number of modes and dimensions, and also 
examine its sensitivity to parameter settings. A preliminary 
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Figure 1 : The plots shows the effect of annealing on a Mix- 
ture of Gaussians as the temperature T is varied from oo to 
0. Annealing temperature is displayed below the plot. 

version of our paper appears in [10]. 

The rest of the paper is organized as follows. Section 
2 provides a background introduction to annealed particle 
filters. Section 3 highlights the problems with APF and mo- 
tivates the need for our method. Section 4 provides a de- 
tailed description of our method. Section 5 compares the 
proposed method with APF on a simple scalable problem 
and performs sensitivity analysis on the parameters of our 
method. Following that, we provide a summary of the track- 
ing results using data from Human Eva I dataset. Section 6 
indicates the future work and concludes the paper. 

2. Background 

In this section we review the background of annealed 
particle filters to motivate the need for our method and to 
enable the exposition of it. 

Particle filter [ ] is a well-known technique in computer 
vision for tracking. It propagates the uncertainty about the 
tracked object non parametrically using a set of weighted 
samples. Let x t and z t be random variables corresponding 
to the state of the object and observation at time t. Let be 
the set of N samples X{ and their corresponding normalized 
weights TTi that represent the distribution of x t . The notation 
Zf.i is used to indicate the set of observations till time t. 
Formally 

N 

p(x t \z t: i;^) = y^7Tid(x t - Xi) (1) 
i=l 

Applying the first order Markov and the sensor assumptions 
[ ] one can obtain the posterior at time t + 1 by multiplying 
the likelihood with the prior, where the prior is obtained 
by convolving posterior from time t with the motion prior. 
Assuming the motion prior p(x t +i \x t ) to be a combination 
of a function f(x t ) and Gaussian uncertainty of co variance 
E, results in a prior density shown below 

N 

p(x t+ i|^ : i;^) = 5^7r i W(a ?< );E) (2) 

Particle filters use the prior density as a proposal density 
for importance sampling [ ] from the likelihood. Multi- 
plying the likelihood by the prior provides the posterior at 



time t + 1. This results in a four step procedure for track- 
ing i.e. re-sample, drift, diffuse and evaluate[ ]. The first 
three steps can be shown to generate samples from the prior 
density (2), and the last step can be shown to perform im- 
portance sampling and the necessary multiplication to ob- 
tain the posterior. The algorithm hinges on the critical as- 
sumption that the prior density has a good overlap with the 
likelihood for the importance sampling to be effective. 

Annealing has the effect of transforming any distribution 
from a uniform distribution to a delta function in the global 
maximum as the temperature is changed from oo to 0. Fig- 
ure 1 shows the effect of annealing on a multimodal distri- 
bution. Formally, the process of annealing can be expressed 
as below 

f(x,T)cxp(x)± (3) 

where p(x) : R d \-> R, is a non-negative function with a 
finite integral and T is the annealing temperature. 

APF[ ] incorporates annealing into particle filtering by 
exploiting the fact that a distribution at a higher temperature 
effectively has a wider support (as can be observed in Figure 
1), and hence could be used as a proposal distribution for 
importance sampling from the same distribution at a lower 
temperature. Hence in a single iteration (that corresponds to 
a time interval t), it performs several particle filtering steps 
referred to as layers. It assumes f(x t ) to be x t in successive 
layers (since the distribution is not translated by annealing) 
and the variance to be gradually decreasing by a fraction a m 
(where m is the annealing layer and a < 1) to account for 
the narrow and peaked objective that results from annealing 
(see Figure 1). Annealing is performed by a simple sched- 
ule by controlling the number of unique resampled particles 
to be approximately half of that which exist in the previous 
layer. Consequently, at the end of a fixed number of layers, 
the set of samples that represented a multimodal density in 
the start represent a delta distribution in the end. Hence the 
expected state of the set of samples approximates the global 
maximum with high accuracy. The high level steps done in 
APF are as follows. 

1. N samples x^i G {1, . . . , N} are obtained by sam- 
pling from M{x t \ E), where x t is the estimate of the 
state in last iteration. 

2. The likelihood yi is evaluated at sample 

3. A suitable annealing temperature T m that ensures only 
50% samples will survive resampling is estimated. 

4. Normalized sample weights are obtained, 7^ = 
^ m /Eti y? m , where j3 m = 1 /r m 

5. N new samples are generated by sampling (resample 
and diffuse) from J2i=i n i N{xi\ a m E). 

6. Repeat from step 2 for a fixed number of layers. 
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Figure 2: Test demonstrating effects of reusing samples. 
The subfigures a, b and c correspond to the configuration 
that discards samples, the one that retains them and the one 
that retains along with a deep annealing schedule. It can 
be observed that reusing makes the search less effective (b), 
which is overcome by augmenting it with a deep annealing 
schedule (c). 

7. The global maximum is estimated as the expected state 
of the weighted samples. 

The critical aspect of APF is that it is capable of track- 
ing with significantly fewer number of samples. Studies 
done on human pose tracking show that APF is capable of 
tracking with 1000 [ ] samples in comparison to 10,000 
samples that are required by particle filters [14]. 

3. Motivation 

The APF algorithm generates N new samples in each 
layer. However, samples from previous layers are dis- 
carded. Since the evaluation of likelihood incurs a signif- 
icant overhead, one would want to retain the relevant sam- 
ples from previous layers. The most natural way to do this 
would be to simply retain the samples X{ along with the 
corresponding likelihoods y ir and raise them to a new an- 
nealing temperature T m for the layer before normalization. 
However, this does not work well in practice. 

To demonstrate the effect of simply reusing, we con- 
ducted an experiment. We considered the log likelihood to 
be a quadratic, — \x T x, where x G M 30 . To verify how 
well APF searches through the state space we started from 
a fixed state and searched for the optimum. We compared 
two configurations of APF: the original configuration that 
discards samples, and a second configuration which retains 
them by raising the likelihood to the appropriate f3 m for the 
layer before normalization. To measure how effectively the 
methods locate the optimum, we obtained the distribution of 
the L2 norm of the state estimated by APF. The histograms 
of the L2 norm are shown in Figure 2a & 2b. It can be ob- 
served that when samples are discarded, the state estimate 
is closer to the true optimum. 

The test demonstrates the problems with retaining sam- 
ples. The critical insight, however, is that the problems 
with reusing samples are more than overcome by choos- 
ing a "deep" annealing schedule that consistently reduces 
the annealing temperature, rather than a simple schedule 
which ensures 50% samples survive resampling. The pre- 
cise formulation of a deep annealing schedule as a power 
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Figure 3: Example demonstrating the effect of anneal- 
ing(color). It can be seen how annealing shifts the proba- 
bility mass towards regions that improve the objective. 

law, which is a novel aspect of our method, is given in Sec- 
tion 4. The improvement is shown in the histogram in Fig- 
ure 2c which augments retaining samples with our proposed 
deep annealing schedule. The result can be best explained 
by analyzing another simple example. Figure 3a shows a ID 
Gaussian density in blue. A set of weighted samples were 
obtained by diffusing samples around x = 3. The sam- 
ples are shown in red and the kernel density approximated 
by the samples using a Gaussian kernel is shown in green. 
If the samples were to be re-sampled and diffused with the 
same Gaussian kernel, it is equivalent to generating samples 
from the density shown in green. Figure 3b shows the ef- 
fect of annealing on the kernel density. It can be observed 
that annealing effectively shifts the kernel density mass to- 
wards regions that improve the search objective. Conse- 
quently, when enabled with a suitable annealing schedule, 
new samples are generated in areas of the state- space that 
improve the objective. Expressed differently, a good an- 
nealing schedule ensures that the worthless samples which 
end up in areas of the state- space with poor objective don't 
adversely impact the search process by quickly driving their 
normalized weights to zero in successive layers. 

In addition, one would expect to have a slow annealing 
schedule when tracking articulated objects since their like- 
lihoods are multimodal and fast annealing is known to con- 
verge to local optima. APF can be interpreted as generating 
new samples from a kernel density with a Gaussian kernel 
of gradually reducing noise covariance. It is well-known 
that the kernel density approximated by a set of samples is 
highly sensitive to the type of kernel. Therefore such a fixed 
diffusion schedule is very sensitive to the parameters, more 
so when the annealing is slow and longer. This is evident 
from the two-fold reduction [6] in the number of samples 
needed to track when using an adaptive kernel. We take 
the idea of adaptive diffusion density a step further, and use 
an intermediate parametric form which is inferred from the 
set of samples to act as a substitute for the kernel density 
i.e. instead of resample and diffuse with a kernel, new sam- 
ples are generated from a parametric model. The motivation 
is that, with an intermediate parametric form, the density 
would adapt even better. 

Inferring an intermediate parametric form has an added 
advantage when the set of samples in the last layer of an- 
nealing still represent a multimodal density. Such a scenario 




Figure 4: Example demonstrating the effect of a paramet- 
ric form on the state estimate. The samples are shown in 
red, and an inferred parametric form is shown in green. The 
expected state of all samples, and the maximum of the para- 
metric form, are shown in black and cyan respectively. It 
can be seen how the expected state is a worse estimate of 
the true optimum at -1. 



can be observed in Figure 4, it shows a set of samples in the 
last layer of annealing in red. A parametric density inferred 
from the set of samples is shown in green. The paramet- 
ric density shows that the samples represent a multimodal 
density. It can be observed that the expected state of such a 
sample set deviates from the global maximum, and a better 
estimate is obtained from the parametric form. These con- 
siderations motivate the need for an improved procedure, 
which we refer to as parametric annealing. We describe our 
method in detail in the next section. 

4. Parametric Annealing 

Let P{x) (x G M d ) be a general multi-modal distribu- 
tion, and let ip be the set of TV weighted samples (xi,iri) 
that represent P(x). Formally, 

N 

P(x) ~ p(x; ip) = TTi S(x — Xi) (4) 

i=l 

Let q(x; 6) be a Mixture of Gaussians (MoG), parametrized 
by that approximates p(x;%jj). We estimate the parameters 
of q{x\ 0) using Expectation Maximization (EM) adapted to 
include the sample weights 7^. Formally, 

q(x;0)^p(x^)^P(x) (5) 

With this terminology, the high level steps done in each 
iteration of our algorithm are now described. Subsequently, 
the steps are compared to the APF, and the differences are 
highlighted. 

1. N° samples x^, i G {1, . . . , A^ } are obtained by sam- 
pling from N(x t ] E), where x t is the estimate of the 
state in last iteration. 

2. The likelihood yi is evaluated for new samples X{. 

3. A suitable annealing temperature T m is estimated us- 
ing a deep annealing schedule. 



4. Normalized sample weights 7r™ = y? m / y? m , 
where j3 m = 1 /T m are estimated. 

5. A parametric MoG approximation q(x; m ) is inferred 
from the weighted samples ip using one EM iteration. 

6. C new samples x^ are generated from the parametric 
model and combined with old samples. 

7. Repeat from step 2 for M layers to simulate a very 
slow annealing. 

8. The global maximum is estimated from the parametric 
model. 

The main novelty of our procedure over APF is twofold. 

• We don't discard samples in step 6, and moreover, 
we use a deep annealing schedule to enable effective 
search in step 3. 

• We use a parametric model in steps 5 & 6 instead of the 
kernel diffusion density, and obtain the estimate from 
the parametric form in step 8. 

Below, we provide details of the method, which may be 
skipped by those readers who want to consider the results 
in the next section. 

The high level steps above include some of the parame- 
ters of the algorithm. The superscript m in any parameter 
indicates that it is a function of the layer, m G (1, . . . , M), 
where M is the number of annealing layers. The parame- 
ter N m represents the total number of samples in layer m. 
As opposed to APF where the number of samples in each 
layer remain fixed, the number of samples in our method 
grows in successive layers since we don't discard samples. 
As we introduce C new samples in a layer and we start with 
N° samples, the number of samples in each layer N m is 
N° + mC, bringing the total number of samples for an it- 
eration to N° + MC. 

We fix the number of mixture components in the MoG 
q(x;0) to C, which is also the number of new samples in- 
troduced in a layer. Consequently, the parameters m is 
the set of C means (/i™), covariances and the mix- 

ture weights (</>™). The subscript c and the superscript m 
indicates that these parameters are dependent on the spe- 
cific mixture component c and the layer m. The "E & M" 
steps [ ] in the EM algorithm are well known. A regularizer 
£m fa™. m i nc l u ded in the M step update for SJ! 1 to ensure that 
the covariance matrices are full rank. Since we are also an- 
nealing the samples while inferring the parametric form we 
find that a gradually reducing regularizer is more effective 
than a fixed regularizer. 

The annealing schedule is a critical part of our algorithm. 
We use a measure similar to particle survival rate a defined 
in [7] to control the annealing schedule. The parameter T m 



which is the annealing temperature for the layer m can be 
estimated from a predefined sequence of a m by any local 
optimization technique. The parameters a m and £ m (regu- 
larizer) define the annealing schedule. We define these pa- 
rameters by the following power law 



(6) 



Where £ is a d x d diagonal matrix with each diagonal el- 
ement equaling half the maximum change in state estimate 
along that dimension. The significant difference between 
our schedule and that used by APF is that APF sets the pa- 
rameter a m to 0.5 for all m. Thus r] G (0, 1), A G (0, 1), 
N°, M and C are the external parameters used by our 
method. These parameters are dependent upon the state 
space dimension d and the density P(x). 

5. Analysis and comparison 
5.1. Scaling properties 

In this section we analyze how our method scales with 
the dimensionality, multi-modality and the range of search 
by comparing it with APF on a simple scalable problem. 
The problem used for the comparison is to let the two 
stochastic procedure search for the optimum state in a MoG. 
Inspired by [ ], we used a generative method to model the 
MoG used as the search objective. We assume the search 
objective 6 (which is a MoG) to be a random draw from a 
distribution P(6). The distribution P(6) is sampled us- 
ing a generative method i.e., the mixture weights are drawn 
from a Dirichlet distribution of uniform prior, the inverse 
covariances are drawn from a Wishart distribution with d di- 
mensional identity as scale matrix and the mean vectors are 
drawn from a d dimensional Gaussian with a scaled identity 
matrix (by scale s) as covariance and zero mean. 

The variable d, the number of mixture components k 
and the scale of the covariance s control the dimensionality, 
multimodality and search range of the objective. We made 
both the procedures start from the origin of the euclidean 
space R d and search for the optimum. We considered the 
difference in the log likelihood between the end state and 
the start state to be a random variable I (for improvement 
in the log likelihood) that acts as an indicator of how well 
the procedure performs. Since the procedures are stochas- 
tic in nature we ran the procedure several times on different 
samples from P(Q) and considered the mean and standard 
deviation of / to be indicative of the performance. 

Figure 5 plots the scaling properties of the proposed 
method in comparison to the APF. We compared our 
method to two configurations of APF. The configurations 
are APF 1000 (M = 5, N = 200) and APF 500 (M = 
5,iV = 100), where M and N are annealing layers and 
particles per layer respectively. Our method was configured 
with 438 samples (438 = A/ -0 + M * C = 150 + 24 * 12). 
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Figure 5: Scaling properties of the PA (blue) in compari- 
son to the APF 1000 (green) and APF 500 (red). The plots 
shows the mean (top row) and deviation (bottom row) of 
improvement in log likelihood for different values of di- 
mension d, number of components k and range parameter s 
respectively. It can be observed that our method improves 
the likelihood more than APF, and that the deviation in im- 
provement is almost the same. 

The mean and standard deviation of the random variable I is 
plotted against various values of dimension d, components 
k and scale factor s. Higher mean value of / and lower devi- 
ation are better. The result might seem to show that all pro- 
cedures somehow perform better in higher dimensions and 
as the range increases, but this is not true. It is caused by 
the vast change in the log likelihood in higher dimensional 
Gaussian mixtures and over long search ranges. However, 
it can be observed that, in all three tests, our method im- 
proves the log likelihood better than APF despite using less 
than 50% of the samples. It either performs as good as or 
slightly better than APF with regards to the deviation of im- 
provement. 

5.2. Sensitivity analysis 

Both our method, as well as APF, have quite a number 
of parameters. A standard set of parameters proposed in 
[ ] is used in most studies involving APF [15]. However, 
how these parameters affect the tracking performance has 
not been studied. This motivated us to perform sensitiv- 
ity analysis on the parameters of our own algorithm since 
it would provide insight into how those parameters affect 
tracking. We performed this study in the same objective as 
the previous experiment. The test was performed by vary- 
ing each parameter rj, A, N°, M and C around a small range 
from the default parameter used in previous test and observ- 
ing the mean and deviation of the variable /. Figure 6 shows 
the results of sensitivity analysis. It can be observed that our 
algorithm is largely insensitive to the initial number of sam- 
ples N°. In fact, it slightly improves as the initial number 
of samples reduce, which may be understood by analyzing 
Figure 3. When the initial number of samples are higher, the 
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Figure 6: The plot shows the mean improvement and deviation for various settings of the parameters of our procedure. The 
parameters TV , M and C refer to initial number of samples, number of annealing layers and the number of new samples 
introduced per layer respectively. The parameters 77 and A define the power law used to control the rate of annealing. 




search is ineffective since the samples hold back the search 
procedure until they are annealed to a suitable temperature. 
Similar observations were made with human pose tracking, 
suggesting one might reduce the initial number of samples 
without losing performance. 

An increase in the number of annealing layers M im- 
proves the performance, this justifies our choice of slow an- 
nealing. However it can be observed the deviation shows a 
slight decreasing trend as the number of layers reduce. We 
believe this is due to the cumulative effect of the random- 
ness introduced in every layer. Hence a suitable trade-off 
between the two is necessary. An increase in the number of 
new samples C introduced in a layer improves the perfor- 
mance, this is expected since when there are more samples 
the search should be better. A decrease in the parameters 
77 and A that define the annealing schedule shows an im- 
provement in performance, indicating that the deeper the 
annealing, the better the performance of the search proce- 
dure. This is expected since, as shown in Section 3, a deep 
annealing schedule is necessary to ensure that the new sam- 
ples search the state space effectively. 

5.3. Tracking Results 

We compared the proposed method with two configura- 
tions of APR a) 1000 particle configuration with 200 parti- 
cles per layer and 5 layers b) 500 particle configuration with 
100 particles per layer and 5 layers. Our method was con- 
figured with 438 samples (A/ -0 = 150, M = 24, C = 12). 
We used data from the Human Eva I dataset for the com- 
parison. We review the overall results here which will be 
presented in detail elsewhere. Table 1 presents the results 
of the comparison from 6 different video sequences. Each 
tracking method was executed 10 times on a sequence and 
the time and ensemble average of the tracking error and the 
deviation are shown in the Table. We observe that, for all 
videos except one, our method has the lowest tracking er- 
ror. Furthermore, the difference is significant for jogging 
sequences. It is notable that our method reduce despite the 
fact that it uses half the number of sample, and requires half 
the runtime as APF 1000. 

Figure 7 shows the tracked model from all three meth- 



ods superimposed on images used for tracking. We used 
the Subject 1 jogging sequence for the figure since, as ev- 
ident from Table 1 it is the sequence that by a large extent 
discriminates the three configurations. It can be observed 
that the tracked output from our method is visibly closer to 
the Subject. 

6. Conclusion & Future Work 

In this paper we described a procedure that improves on 
APF for tracking a human from a video sequence. Using 
synthetic examples we demonstrate the critical problems 
with APF and show how they are overcome by the novel as- 
pects of our procedure, which include reusing samples with 
a suitably deep annealing schedule, and also by inferring 
a parametric form from the samples. We then compared 
the proposed method to APF in a simple scalable problem 
and show that our method consistently performs better than 
APF. This was followed by sensitivity analysis showing that 
our algorithms' performance is largely insensitive to the pa- 
rameter settings. Finally, we present human pose tracking 
results using data from the Human Eva I dataset that show 
the benefit of using our algorithm. We plan to explore tech- 
niques to optimally estimate the parameters for tracking and 
to understand how various attributes like the frame rate and 
the articulated motion affect the tracking performance. 
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32.49 s 
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14.18 s 



Frame 1 



Table 1: Overall statistics. Best results are shown in bold. 
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Figure 7: Qualitative comparison of tracked output on Subject 1 jogging sequence. First, second and third row correspond to 
APF 500, APF 1000 and PA 438 respectively. The model fit can be observed to be worst for APF 500, better for APF 1000, 
and best for PA 438. 
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